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Abstract 

The  need  to  measure  sequence  similarity 
arises  in  information  extraction,  object  iden¬ 
tity,  data  mining,  biological  sequence  analy¬ 
sis,  and  other  domains.  This  paper  presents 
discriminative  string-edit  CRFs,  a  finite- 
state  conditional  random  field  model  for  edit 
sequences  between  strings.  Conditional  ran¬ 
dom  fields  have  advantages  over  generative 
approaches  to  this  problem,  such  as  pair 
HMMs  or  the  work  of  Ristad  and  Yiani- 
los,  because  as  conditionally-trained  meth¬ 
ods,  they  enable  the  use  of  complex,  arbitrary 
actions  and  features  of  the  input  strings.  As 
in  generative  models,  the  training  data  does 
not  have  to  specify  the  edit  sequences  be¬ 
tween  the  given  string  pairs.  Unlike  genera¬ 
tive  models,  however,  our  model  is  trained  on 
both  positive  and  negative  instances  of  string 
pairs.  We  present  positive  experimental  re¬ 
sults  on  several  data  sets. 

1  Introduction 

Parameterized  string  similarity  models  based  on  string 
edits  have  a  long  history  (Levenshtein,  1966;  Needle- 
man  &  Wunsch,  1970;  Sankoff  &  Kruskal,  1999).  How¬ 
ever,  there  are  few  methods  for  learning  model  pa¬ 
rameters  from  training  data,  even  though,  as  in  other 
tasks,  learning  may  lead  to  greater  accuracy  on  real- 
world  problems. 

Ristad  and  Yianilos  (1998)  proposed  an  expectation- 
maximization-basecl  method  for  learning  string  edit 
distance  with  a  generative  finite-state  model.  In  their 
approach,  training  data  consists  of  pairs  of  strings  that 
should  be  considered  similar,  and  the  parameters  are 
probabilities  of  certain  edit  operations.  In  the  E-step, 
the  highest  probability  edit  sequence  is  found  using  the 
current  parameters;  in  the  M-step  the  probabilities  are 


re-estimated  using  the  expectations  determined  in  the 
E-step  so  as  to  reduce  the  cost  of  the  edit  sequences  ex¬ 
pected  to  have  caused  the  match.  A  useful  attribute  of 
this  method  is  that  the  edit  operations  and  parameters 
can  be  associated  with  states  of  a  finite  state  machine 
(with  probabilities  of  edit  operations  depending  on 
previous  edit  operations,  as  determined  by  the  finite- 
state  structure.)  However,  as  a  generative  model,  this 
model  cannot  tractably  incorporate  arbitrary  features 
of  the  input  strings,  and  it  cannot  benefit  from  nega¬ 
tive  evidence  from  pairs  of  strings  that  (while  partially 
overlapping)  should  be  considered  dissimilar. 

Bilenko  and  Mooney  (2003)  extend  Ristad’s  model  to 
include  affine  gaps,  and  also  present  a  learned  string 
similarity  measure  based  on  unordered  bags  of  words, 
with  training  performed  by  an  SVM.  Cohen  and  Rich- 
man  (2002)  use  a  conditional  maximum  entropy  clas¬ 
sifier  to  learn  weights  on  several  sequence  distance  fea¬ 
tures.  A  survey  of  string  edit  distance  measures  is  pro¬ 
vided  by  Cohen  et  al.  (2003).  However,  none  of  these 
methods  combine  the  expressive  power  of  a  Markov 
model  of  edit  operations  with  discriminative  training. 

This  paper  presents  an  undirected  graphical  model  for 
string  edit  distance,  and  a  conditional-probability  pa¬ 
rameter  estimation  method  that  exploits  both  match¬ 
ing  and  non-matching  sequence  pairs.  Based  on  con¬ 
ditional  random  fields  (CRFs),  the  approach  not  only 
provides  powerful  capabilities  long  sought  in  many  ap¬ 
plication  domains,  but  also  demonstrates  an  interest¬ 
ing  example  of  discriminative  learning  of  a  probabilis¬ 
tic  model  involving  structured  latent  variables. 

The  training  data  consists  of  input  string  pairs,  each 
associated  with  a  binary  label  indicating  whether  the 
pair  should  be  considered  a  “match”  or  a  “mismatch.” 
Model  parameters  are  estimated  from  both  positive 
and  negative  examples,  unlike  in  previous  generative 
models  (Ristad  &  Yianilos,  1998;  Bilenko  &  Mooney, 
2003).  As  in  those  models,  however,  it  is  not  necessary 
to  provide  the  desired  edit-operations  or  alignments — 
the  alignments  that  enable  the  most  accurate  discrimi- 
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nation  will  be  discovered  automatically  through  an  EM 
procedure.  Thus  this  model  is  an  example  of  an  inter¬ 
esting  class  of  graphical  models  that  are  trained  condi¬ 
tionally,  but  have  latent  variables,  and  find  the  latent 
variable  parameters  that  maximize  discriminative  per¬ 
formance.  Another  recent  example  includes  work  on 
CRFs  for  object  recognition  from  images  (Quattoni 
et  al.,  2005). 

The  model  is  structured  as  a  finite-state  machine 
(FSM)  with  a  single  initial  state  and  two  disjoint  sets 
of  non-initial  states  with  no  transitions  between  them. 
State  transitions  are  labeled  by  edit  operations.  One 
of  the  disjoint  sets  represents  the  match  condition,  the 
other  the  mismatch  condition.  Any  non-empty  tran¬ 
sition  path  starting  at  the  initial  state  defines  an  edit 
sequence  that  is  wholly  contained  in  either  the  match 
or  mismatch  subsets  of  the  machine.  By  marginalizing 
out  all  the  edit  sequences  in  a  subset,  we  obtain  the 
probability  of  match  or  mismatch. 

The  cost  of  a  transition  is  a  function  of  its  edit  opera¬ 
tion,  the  previous  state,  the  new  state,  the  two  input 
strings,  and  the  starting  and  ending  position  (the  po¬ 
sition  of  the  match-so-far  before  and  after  performing 
this  edit  operation)  for  each  of  the  two  input  strings. 
In  applications,  we  take  full  advantage  of  this  flexi¬ 
bility.  For  example,  the  cost  function  can  examine 
portions  of  the  input  strings  both  before  and  after  the 
current  match  position,  it  can  examine  domain  knowl¬ 
edge,  such  as  lexicons,  or  it  can  depend  on  rich  con¬ 
junctions  of  more  primitive  features. 

The  flexibility  of  edit  operations  is  possibly  even  more 
valuable.  Edits  can  make  arbitrarily-sized  forward 
jumps  in  both  input  strings,  and  the  size  of  the  jumps 
can  be  conditioned  on  the  input  strings,  the  current 
match  points  in  each,  and  the  previous  state  of  the 
finite  state  process.  For  example,  a  single  edit  oper¬ 
ation  could  match  a  three-letter  acronym  against  its 
expansion  in  the  other  string  by  consuming  three  cap¬ 
italized  characters  in  the  first  string,  and  consuming 
three  matching  words  in  the  second  string.  The  cost  of 
such  an  operation  could  be  conditioned  on  the  previous 
state  of  the  finite  state  process,  as  well  as  the  appear¬ 
ance  of  the  consumed  strings  in  various  lexicons,  and 
the  words  following  the  acronym. 

Inference  and  training  in  the  model  depends  on  a  com¬ 
plex  dynamic  program  in  three  dimensions.  We  em¬ 
ploy  various  optimizations  to  speed  learning. 

We  present  experimental  results  on  five  standard  text 
data  sets,  including  short  strings  such  as  names  and 
addresses,  as  well  as  longer  more  complex  strings,  such 
as  bibliographic  citations.  We  show  significant  error 
reductions  in  all  but  one  of  the  data  sets. 


2  Discriminatively  Trained  String 
Edit  Distance 

Let  x  =  x\  ■  ■  ■  xm  and  y  =  yi  ■  ■  ■  yn  be  two  strings  or 
symbol  sequences.  This  pair  of  input  strings  is  associ¬ 
ated  with  an  output  label  z  €  {0,1}  indicating  whether 
or  not  the  strings  should  be  considered  a  match  (1)  or 
a  mismatch  (0).1  As  we  now  explain,  our  model  scores 
alignments  between  x  and  y  as  to  whether  they  are 
a  match  or  a  mismatch.  An  alignment  a  is  a  four¬ 
tuple  consisting  of  a  sequence  of  edit  operations,  two 
sequences  of  string  positions,  and  a  sequence  of  FSM 
states. 

Let  a.e  =  e\  ■  ■  •  e*,  indicate  the  sequence  edit  op¬ 
erations,  such  as  delete-one-character-in-x,  substitute- 
one-character-in-x-for-one-character-in-y,  or  delete-a 1 1- 
characters-in-x-up-to-its-next-nonalphabetic.  Each  edit 
operation  ep  in  the  sequence  consumes  either  some  of 
x  (deletion),  some  of  y  (insertion),  or  some  of  both 
(substitution),  up  to  positions  ip  in  x  and  jp  in  y.  We 
have  therefore  corresponding  non-decreasing  sequences 
a.ix  =  *!,...,**  and  a.iy  =  j i, . . . ,  jk  of  edit-operation 
positions  for  x  and  y. 

To  classify  alignments  into  matches  or  mismatches,  we 
take  edits  as  transition  labels  for  a  non-deterministic 
FSM  with  state  set  S  =  {go}  U  So  U  Si.  There  are 
transitions  from  the  initial  state  go  to  states  in  the 
disjoint  sets  So  and  Si,  but  no  transitions  between 
those  two  sets.  In  addition  to  the  edit  sequence  and 
string  position  sequences,  we  associate  the  alignment 
a  with  a  sequence  of  consecutive  destinations  states 
a.q  =  gi  •  •  •  qk,  where  ep  labels  an  allowed  transition 
from  gp_i  to  qp.  By  construction,  either  a.q  C  So  or 
a.q  C  Si.  Alignments  with  states  in  Si  are  supposed 
to  represent  matches,  while  alignments  with  states  in 
So  are  supposed  to  represent  mismatches. 

In  summary,  an  alignment  is  specified  by  the  four¬ 
tuple  a  =  (a.e  =  ei  •  •  •  e*,, a.ix  =  *i  ••■»*, a.iy  = 
Ji  •  •  •  jfc,  aS  =  gi  ■■■qk)-  For  convenience,  we  also 
write  a  =  a0,ai---ak  with  ap  =  (ep,ip,jp,qp),l  < 
p  <  k  and  ao  =  (—,0,0,  go)  where  —  is  a  dummy  ini¬ 
tial  edit. 

Given  two  strings  x  and  y,  our  discriminative  string 
edit  CRF  defines  the  probability  of  an  alignment  be¬ 
tween  x  and  y  as 

1  i&l 

P(a|x,y)  =  — —  J|$(ai_i,ai,x,y), 

x'y  i=i 

1One  could  also  straightforwardly  imagine  a  different 
regression-based  scenario  in  which  2  is  real- valued,  or  also 
a  ranking-based  criteria,  in  which  two  pairs  are  provided 
and  2  indicates  which  pair  of  strings  should  be  considered 
closer. 


where  the  potential  function  $(•)  is  a  non-negative 
function  of  its  arguments,  and  Zx y  is  the  normalizer 
(partition  function).  In  our  experiments  we  parame¬ 
terize  these  potential  functions  as  an  exponential  of  a 
linear  scoring  function 

$(aj_i,Oi,x,y)  =  exp  A  •  f(a;_i,aj,x,y), 

where  f  is  a  vector  of  feature  functions,  each  taking 
as  arguments  two  consecutive  states  in  the  alignment 
sequence,  the  corresponding  edits,  and  their  string  po¬ 
sitions,  which  allow  the  feature  functions  to  depend  on 
the  context  of  a*  in  x  and  y.  A  typical  feature  function 
combines  some  predicate  on  the  input,  or  input  feature, 
with  a  predicate  over  the  alignment  itself  (edit  opera¬ 
tion,  states,  positions). 

To  obtain  the  probability  of  match  given  simply  the 
input  strings,  we  marginalize  over  all  alignments  in 
the  corresponding  state  set: 

1  |a| 

P(z |x,y)=  ^2  - — J|$(aj_i,ai,x,y), 

a.qCS-  i= 1 

Fortunately,  this  sum  can  be  calculated  efficiently  by 
dynamic  programming.  Typically,  for  any  given  edit 
operation,  starting  positions  and  input  strings,  there 
are  a  small  number  of  possible  resulting  ending  posi¬ 
tions.  Max-product  (Viterbi-like)  inference  can  also 
be  performed  efficiently. 

3  Parameter  Estimation 

Parameters  are  estimated  by  penalized  maximum  like¬ 
lihood  on  a  set  of  training  data.  Training  data  consists 
of  a  set  of  N  string  pairs  (xtd,yO))  with  correspond¬ 
ing  labels  z^'1  G  {0,1},  indicating  whether  or  not  the 
pair  is  a  match.  We  use  a  zero-mean  spherical  Gaus¬ 
sian  prior  \\/o2  for  penalization. 

The  incomplete  (non-penalized)  log-likelihood  is  then 

Ci  =  (^logp(2(j)|x(j),y(j))) 

3 

and  the  complete  log-likelihood  is 

Cc  =  (EE  log  |a,  x^ ,  y^)p(a|xtO ,  y  ^))) 

3  a 

We  maximize  this  likelihood  with  EM,  estimating 
p( a|x00,yO'))  given  current  parameters  A  in  the  E- 
step,  and  maximizing  the  complete  penalized  log- 
likelihoocl  in  the  M-step.  For  optimization  in  the  M- 
step  we  use  BFGS.  Unlike  CRFs  without  latent  vari¬ 
ables,  the  objective  function  has  local  maxima.  To 
avoid  getting  stuck  in  poor  local  maxima,  the  param¬ 
eters  are  initialized  to  yield  a  reasonable  default  edit 
distance. 


Dynamic  programming  for  this  model  fills  a  three- 
dimensional  table  (two  for  the  two  input  strings,  and 
one  for  the  states  in  S).  The  table  can  be  moderately 
large  in  practice  (n  =  m  =  100  and  A  =  12,  resulting 
in  120,000  entries),  and  beam  search  may  effectively  be 
used  to  increase  speed,  just  as  in  speech  recognition, 
where  even  larger  tables  are  common. 

It  is  interesting  to  examine  what  alignments  will  be 
learned  in  So,  the  non-match  portion  of  the  model.  To 
attain  high  accuracy,  these  states  should  attract  string 
pairs  that  are  dissimilar.  But  even  similar  strings  have 
bad  alignments,  for  example  the  alignment  that  first 
deletes  all  of  x,  and  then  inserts  all  of  y.  Fortunately, 
finding  how  dissimilar  two  strings  are  requires  finding 
as  good  an  alignment  as  is  possible,  and  then  deciding 
that  this  alignment  is  not  very  good.  These  as-good- 
as-possible  alignments  are  exactly  what  our  learning 
procedure  discovers:  driven  by  an  objective  function 
that  aims  to  maximize  the  likelihood  of  the  correct 
binary  match/non-match  labels,  the  model  finds  the 
latent  alignment  paths  that  enable  it  to  maximize  this 
likelihood. 

This  model  thus  falls  in  a  family  of  interesting  tech¬ 
niques  involving  discrimination  among  complex  struc¬ 
tured  objects,  in  which  the  structure  or  relationship 
among  the  parts  is  unknown  (latent),  and  the  latent 
choice  has  high  impact  on  the  discrimination  task. 
Similar  considerations  are  at  the  core  of  discrimina¬ 
tive  non-probabilistic  methods  for  structured  problems 
such  as  handwriting  recognition  (LeCun  et  al.,  1998) 
and  speech  recognition  (Woodland  &  Povey,  2002), 
and,  more  recently,  computer  vision  object  recogni¬ 
tion  (Quattoni  et  al.,  2005).  We  discuss  related  work 
further  in  Section  6. 

4  Implementation 

The  model  has  been  implemented  as  part  of  the 
finite-state  transducer  classes  in  Mallet  (McCallum, 
2002).  We  map  three-dimensional  dynamic  program¬ 
ming  problems  over  positions  in  x  and  y  and  states 

5  to  Mallet’s  existing  finite-state  forward-backward 
and  Viterbi  implementations  by  encoding  the  two  po¬ 
sition  indices  into  a  single  index  in  a  diagonal  crossing 
pattern  that  starts  at  (0,0).  For  example,  a  single¬ 
character  delete  operation,  which  would  be  a  hop  to  an 
a  adjacent  vertical  or  horizontal  in  the  original  table, 
is  a  longer,  one-dimensional  (but  deterministically- 
calculated)  jump  in  the  encoding. 

In  addition  to  the  standard  edit  operations  (inser¬ 
tion,  deletion,  substitution),  we  have  also  more  pow¬ 
erful  edits  that  fit  naturally  into  this  model,  such 

as  delete-until-end-of-word,  delete-word-in-lexicon,  and 
delete- word-appearing-in-other-string. 


5  Experimental  Results 

We  show  experimental  results  on  one  synthetic  and  six 
real-world  data  sets,  all  of  which  have  been  used  in  pre¬ 
vious  work  evaluating  string  edit  measures.  The  first 
two  data  sets  are  the  name  and  address  fields  of  the 
Restaurant  database.  Among  its  864  records,  112  are 
matches.  The  last  four  data  sets  are  citation  strings 
from  the  standard  Reasoning,  Constraint,  Reinforce¬ 
ment  and  Face  sections  of  the  CiteSeer  data.  The  ra¬ 
tios  of  citations  to  unique  papers  for  these  are  514/196, 
349/242,  406/148  and  295/199  respectively.  Making 
the  problem  more  challenging  than  certain  other  evalu¬ 
ations  on  these  data  sets,  our  strings  are  not  segmented 
into  fields  such  as  title  or  author,  but  are  each  treated 
as  a  single  unsegmented  character  sequence.  We  also 
present  results  on  synthetic  noise  on  person  names, 
generated  by  the  UIS  Database  generator.  This  pro¬ 
gram  produces  perturbed  names  according  to  modifi¬ 
able  noise  parameters,  including  the  probability  of  an 
error  anywhere  in  a  record,  the  probability  of  single 
character  insertion,  deletion  or  swap,  and  the  proba¬ 
bility  of  a  word  swap. 

5.1  Edit  Operations  and  Features 

One  of  the  main  advantages  of  our  model  is  the  abil¬ 
ity  to  include  non-independent  input  features  and  ex¬ 
tremely  flexible  edit  operations.  The  input  features 
used  in  our  experiments  include  subsets  of  the  follow¬ 
ing,  described  as  acting  on  cell  i,j  in  the  dynamic  pro¬ 
gramming  table  and  the  two  input  strings  x  and  y. 

•  same,  different  :  Xi  and  y7  match  (do  not  match); 

•  same-alphabetic,  different-alphabetic  :  x.^  and  yj 
are  alphabetic  and  they  match  (do  not  match); 

•  same-numeric,  different-numeric  :  Xi  and  yj  are  nu¬ 
meric  and  they  match  (do  not  match); 

•  punctuation-x,  punctuation-y  :  Xi  and  yj  are  punc¬ 
tuation,  respectively; 

•  alphabet-mismatch,  number-mismatch  :  One  of  Xi 
and  y:j  is  alphabetic  (numeric),  the  other  is  not; 

•  end-of-x,  end-of-y  :  i  =  |x|  (j  =  |y|); 

•  same-next-character,  different-next-character:  xj+i 

and  yi+i  match  (do  not  match). 

Edit  operations  on  FSM  transitions  include: 

•  Standard  string  edit  operations:  insert,  delete  and 
substitute. 

•  Two  character  operations:  swap-two-characters. 

•  Word  skip  operations:  skip-if-word-in-lexicon,  skip- 
word-if-present-in-other-string,  skip-parenthesized- 
words  and  skip-any-word  . 


•  Operations  for  handling  acronyms  and  abbrevia¬ 
tions  by  inserting,  deleting,  or  substituting  spe¬ 
cific  types  of  substrings. 

Learned  parameters  are  associated  with  the  input  fea¬ 
tures  as  well  as  with  state  transitions  in  the  FSM.  All 
transitions  entering  a  state  may  share  tied  parameters 
(first  order),  or  have  different  parameters  (second  or¬ 
der).  Since  the  FSM  can  have  more  states  than  edit 
operations,  it  can  remember  the  context  of  previous 
edit  actions. 

5.2  Experimental  Methodology 

Our  model  exploits  both  positive  and  negative  exam¬ 
ples  during  training.  Positive  training  examples  in¬ 
clude  all  pairs  of  strings  referring  to  the  same  object 
(the  matching  strings).  However,  the  total  number 
of  negative  examples  is  quadratic  in  the  number  of 
objects.  Due  to  both  time  and  memory  constraints, 
as  well  as  a  desire  to  avoid  overwhelming  the  positive 
training  examples,  we  sample  the  negative  (mismatch) 
string  pairs  so  as  to  attain  a  1:10  ratio  of  match  to  mis¬ 
match  pairs.  In  order  to  preferentially  sample  “near 
misses”  we  filter  negative  examples  in  one  of  two  ways: 

•  Remove  negative  examples  that  are  too  dissimilar 
according  to  a  suitable  metric.  For  the  Citeseer 
datasets  we  use  the  cosine  metric  to  measure  sim¬ 
ilarity  of  two  citations;  for  other  datasets  we  use 
the  metric  of  Jaro  (1989). 

•  Select  the  best  matching  negative  pairs  according 
to  a  CRF  with  parameters  set  by  hand  to  reason¬ 
able  values. 

As  in  Bilenko  and  Mooney  (2003),  we  use  a  50/50 
train/test  split  of  the  data,  and  repeat  the  process 
with  the  folds  interchanged.  With  the  restaurant  name 
and  restaurant  address  dataset,  we  run  our  algorithm 
with  different  choices  of  features  and  states,  and  4  ran¬ 
dom  splits  of  the  data.  With  the  Citeseer  datasets,  we 
have  results  for  two  random  splits  of  the  data. 

To  give  EM  training  a  reasonable  starting  point,  we 
hand-set  the  initial  parameters  to  somewhat  arbitrary, 
yet  reasonable  parameters.  (Of  course,  hand-setting  of 
string  edit  parameters  is  the  standard  for  all  the  non¬ 
learning  approaches.)  We  examined  a  small  held-out 
set  of  data  to  verify  that  these  initial  parameters  were 
reasonable.  We  set  the  parameters  on  the  match  por¬ 
tion  of  the  FSM  to  provide  good  alignments;  then  we 
then  copy  these  parameters  to  the  mismatch  portion  of 
the  model,  offseting  them  by  bringing  all  values  closer 
to  zero  by  a  small  constant. 


Distance  Metric 

Restaurant  name 

Restaurant  address 

Reasoning 

Face 

Reinforcement 

Constraint 

Edit  Distance 

0.290 

0.686 

0.927 

0.952 

0.893 

0.924 

Learned  Edit  Distance 

0.354 

0.712 

0.938 

0.966 

0.907 

0.941 

Vector-space 

0.365 

0.380 

0.897 

0.922 

0.903 

0.923 

Learned  Vector-space 

0.433 

0.532 

0.924 

0.875 

0.808 

0.913 

CRF  Edit  Distance 

0.448 

0.783 

0.964 

0.918 

0.917 

0.976 

Table  1:  Averaged  F-measure  for  detecting  matching  field  values  on  several  standard  data  sets  (bold  indicates 
highest  FI).  The  top  four  rows  are  results  duplicated  from  Bilenko  and  Mooney  (2003);  the  bottom  row  is  the 
performance  of  the  CRF  method  introduced  in  this  paper. 


Lexicons  were  populated  automatically  by  gathering 
the  most  frequent  words  in  the  training  set.  (Alter¬ 
natively  one  could  imagine  lexicon  feature  values  set 
to  inverse-document-frequency  values,  or  similar  infor¬ 
mation  retrieval  metrics.)  In  some  cases,  before  train¬ 
ing,  lexicons  were  edited  to  remove  author  surnames. 

The  equations  in  section  3  are  used  to  calculate 
p(z|x,y),  with  a  first-order  model.  A  threshold  of  0.5 
predicts  whether  the  string  pair  is  a  match  or  a  mis¬ 
match.  (Note  that  alternative  thresholds  could  easily 
be  used  to  trade  of  precision  and  recall,  and  that  CRFs 
are  typically  good  at  predicting  calibrated  posterior 
probabilities  needed  for  such  tuning  as  well  as  accu¬ 
racy/coverage  curves.)  Bilenko  and  Mooney  (2003) 
found  transitive  closure  to  improve  FI,  and  use  it  for 
their  results;  we  did  not  find  it  to  help,  and  do  not. 

Precision  is  calculated  to  be  the  ratio  of  the  number 
of  correctly  classified  duplicates  to  the  total  number 
of  duplicates  identified.  Recall  is  the  ratio  of  correctly 
classified  duplicates  to  the  total  number  of  duplicates 
in  the  dataset.  We  report  the  mean  performance  across 
multiple  random  splits. 

5.3  Results 

In  experiments  on  the  six  real-world  data  sets  we  com¬ 
pare  our  performance  against  results  in  a  recent  bench¬ 
mark  paper  by  Bilenko  and  Mooney  (2003);  Bilenko 
recently  completely  thesis  work  in  this  area.  These  re¬ 
sults  are  summarized  in  Table  1,  where  the  top  four 
rows  are  duplicated  from  Bilenko  and  Mooney  (2003), 
and  the  bottom  row  shows  the  results  of  our  method. 
The  entries  are  the  average  FI  measure  across  the 
folds.  We  observe  large  performance  improvements  on 
most  datasets.  The  fact  that  the  difference  in  perfor¬ 
mance  across  our  trials  is  typically  around  0.01  sug¬ 
gests  strong  statistical  significance.  Our  average  FI 
on  the  Face  dataset  was  0.04  less  than  the  previous 
best.  The  examples  on  which  we  made  errors  gener¬ 
ally  had  a  large  venue,  authors,  or  URL  field  in  one 
string  but  not  in  the  other. 

We  also  evaluate  the  effect  on  performance  of  us¬ 
ing  Viterbi  (max-product)  inference  in  training  in- 


Dataset 

Viterbi 

Forward-Backward 

Restaurant  name 
Restaurant  address 

0.689 

0.708 

0.720 

0.651 

Table  2:  Averaged  F-measures  for  Viterbi  vs.  forward- 
backward  on  (trained  and  evaluated  on  a  subset  of  the 
data;  smaller  test  set  yields  higher  accuracy). 


stead  of  forward-backward  (sum-product)  inference. 
Except  for  the  restaurant  address  dataset,  forward- 
backward  performs  significantly  better  than  Viterbi  on 
all  datasets.  The  restaurant  address  data  set  contains 
positive  examples  with  a  large  unmatched  suffix  in  one 
of  the  strings,  which  may  lead  to  an  inappropriate  dilu¬ 
tion  of  probability  amongst  many  alignments.  Average 
FI  measures  for  the  restaurant  datasets  using  Viterbi 
and  forward-backward  are  shown  in  Table  2.  All  re¬ 
sults  shown  in  Table  1  use  forward-backward  proba¬ 
bilities. 

In  the  other  tables  we  present  results  showing  the  im¬ 
pact  of  various  edit  operations  and  features. 

Table  3  shows  FI  on  the  restaurant  data  set  as  vari¬ 
ous  edit  operations  are  added  to  the  model:  i  denotes 
insert,  d  denotes  delete,  s  denotes  substitute,  paren  de¬ 
notes  skip-parenthesized-word,  lex  denotes  skip-if-word- 
in-lexicon,  and  pres  denotes  skip-word-if-present-in- 
other-string.  All  use  the  same-alphabets  and  different- 
alphabets  input  features.  As  can  be  seen  from  the  re¬ 
sults,  adding  “skip”  edits  improves  performance.  Al¬ 
though  skip-parenthesized-words  gives  better  results  on 
the  smaller  data  set  used  for  the  experiments  in  the 
table,  skip-if-word-in-lexicon  produces  a  higher  accu¬ 
racy  on  larger  data  sets,  because  of  peculiarities  in 
how  restaurants  with  the  same  name  and  different  lo¬ 
cations  are  named  in  the  data  set.  We  also  see  that 
a  second-order  model  performs  less  well,  presumably 
because  of  data  sparseness. 

Table  4  shows  the  benefits  of  including  various  features 
for  the  restaurant  address  data  set,  while  fixing  the  edit 
operations  (insert,  delete  and  substitute).  In  the  table, 
s  and  d  denote  the  same  and  different  features,  salp 


Table  3:  Averaged  maximum  F-measure  for  differ¬ 
ent  state  combinations  on  a  subset  of  restaurant  name 
(trained  and  evaluated  on  the  same  train/test  split). 


s,  d  0.944 

salp,dalp,snum,dnum  0.973 


Table  4:  Averaged  maximum  F  1-measure  for  differ¬ 
ent  feature  combinations  on  a  subset  of  the  restaurant 
address  data  set. 

and  dalp  stand  for  the  same-alphabets  and  different- 
alphabets  features,  and  snum  and  dnum  stand  for  the 
same-numbers  and  different-numbers  features.  The  s 
and  d  features  are  different  from  the  salp,dalp,snum, 
and  dnum  features  in  that  the  weights  learned  for  the 
former  depend  only  on  whether  the  two  characters  are 
equal  or  not,  and  no  separate  weights  are  learned  for 
a  number  match  or  an  letter  match.  We  conjecture 
that  a  number  mismatch  in  the  address  data  needs  to 
be  penalized  more  than  a  letter  mismatch.  Separating 
the  same  and  different  features  into  features  for  letters 
and  numbers  reduces  the  error  from  about  6%  to  3%. 

Finally,  Table  5  demonstrates  the  power  of  CRFs  to  in¬ 
clude  extremely  flexible  edit  operations  that  examine 
arbitrary  pieces  of  the  two  input  strings.  In  particu¬ 
lar  we  measure  the  impact  of  including  the  skip-word- 
if-present-in-other-string  operation,  (“skip”  for  short). 
Here  we  train  and  test  on  the  UIS  synthetic  name 
data,  in  which  the  error  probability  is  40%,  the  typo 
error  probability  is  40%  and  the  swap  first  and  last 
name  probability  is  50%;  (the  rest  of  the  parameters 
were  unchanged  from  the  default  values).  The  differ¬ 
ence  in  performance  is  dramatic,  bringing  error  down 
from  about  14%  to  less  than  2%.  Of  course,  arbi¬ 
trary  substring  swaps  are  not  expressible  in  standard 
dynamic  programs,  but  the  skip  operation  gives  an  ex¬ 
cellent  approximation  while  preserving  efficient  finite- 
state  inference.  Typical  improved  alignments  with  the 
new  operation  may  skip  over  a  matching  swapped  first 
name,  and  then  proceed  to  correct  individual  typo¬ 
graphic  errors  in  the  last  name. 

An  example  alignment  found  by  our  model  on  restau¬ 
rant  name  is  shown  in  Table  7.  As  discussed  in  Sec- 


Without  skip  0.856 
With  skip  0.981 


Table  5:  Average  maximum  F-measure  for  synthetic 
name  dataset  with  and  without  skip-if-present-in-other- 
string  state. 


Table  6:  Alignment  in  both  the  match  and  mismatch 
subsets  of  the  model,  with  correct  prediction.  Opera¬ 
tions  causing  edits  are  in  bold. 

Table  7:  Alignment  in  both  the  match  and  mismatch 
subsets  of  the  model,  with  correct  prediction.  Opera¬ 
tions  causing  edits  in  bold. 

tion  3,  the  mismatch  portion  of  the  model  indeed 
learns  the  best  possible  latent  alignments  in  order  to 
measure  distance  with  the  most  salient  features.  This 
example’s  alignment  score  from  the  match  portion  is 
higher.  The  entries  in  the  dynamic  programming  ta¬ 
ble  i,  d,  s,  I,  and  p  correspond  to  states  reached  by  the 
operations  insert,  delete, substitute,  skip-word-in-lexicon, 
and  skip-parenthesized-word  respectively.  The  symbol 
-  denotes  a  null  transition. 

6  Related  Work 

String  (dis) similarity  metrics  based  on  edit  distance 
are  widely  used  in  applications  ranging  from  approx¬ 
imate  matching  and  duplicate  removal  in  database 
records  to  identifying  conserved  regions  in  compara¬ 
tive  genomics.  Levenshtein  (1966)  introduced  least- 
cost  editing  based  on  independent  symbol  insertion, 
deletion,  and  substitution  costs,  and  Needleman  and 
Wunsch  (1970)  extended  the  method  to  allow  gaps. 
Editing  between  strings  over  the  same  alphabet  can 
be  generalized  to  transduction  between  strings  in  dif¬ 
ferent  alphabets,  for  instance  in  letter-to-sound  map¬ 
pings  (Riley  &  Ljolje,  1996)  and  in  speech  recognition 
(Jelinek  et  al.,  1975). 

In  most  applications,  the  edit  distance  model  is  de¬ 
rived  by  heuristic  means,  possibly  including  some 
data-dependent  tuning  of  parameters.  For  exam¬ 
ple,  Monge  and  Elkan  (1997)  recognize  duplicate  cor¬ 
rupted  records  using  an  edit  distance  with  tunable 


edit  and  gap  costs.  Hernandez  and  Stolfo  (May  1995) 
merge  records  in  large  databases  using  rules  based  on 
domain-specific  edit  distances  for  duplicate  detection. 
Cohen  (2000)  use  a  token-based  TF-IDF  string  simi¬ 
larity  score  to  compute  ranked  approximate  joins  on 
tables  derived  from  Web  pages.  Koh  et  al.  (2004)  use 
association  rule  mining  to  check  for  duplicate  records 
with  per-field  exact,  Levenshtein  or  BLAST  2  gapped 
alignment  (Altschul  et  al.,  1997)  matching.  Cohen 
et  al.  (2003)  surveys  edit  and  common  substring  simi¬ 
larity  metrics  for  name  and  record  matching,  and  their 
application  in  various  duplicate  detection  tasks. 

In  bioinformatics,  sequence  alignment  with  edit  costs 
based  on  evolutionary  or  biochemical  estimates  are 
common  (Durbin  et  al.,  1998).  Position-independent 
costs  are  normally  used  for  general  sequence  similar¬ 
ity  search,  but  position-dependent  costs  are  often  used 
when  searching  for  specific  sequence  motifs. 

In  basic  edit  distance,  the  cost  of  individual  edit  op¬ 
erations  is  independent  of  the  string  context.  How¬ 
ever,  applications  often  require  edit  costs  to  change 
depending  on  context.  For  instance,  the  characters  in 
an  author’s  first  name  after  the  first  character  are  more 
likely  to  be  deleted  than  the  first  character.  Instead 
of  specialized  representations  and  dynamic  program¬ 
ming  algorithms,  we  can  instead  represent  context- 
dependent  editing  with  weighted  finite-state  transduc¬ 
ers  (Eilenberg,  1974;  Mohri  et  al.,  2000)  whose  states 
represent  different  types  of  editing  contexts.  The 
same  idea  has  also  been  expressed  with  pair  hidden 
Markov  models  for  pairwise  biological  sequence  align¬ 
ment  (Durbin  et  al.,  1998). 

If  edit  costs  are  identified  with  —  log  probabilities 
(up  to  normalization),  edit  distance  models  and  cer¬ 
tain  weighted  transducers  can  be  interpreted  as  gen¬ 
erative  models  for  pairs  of  sequences.  Pair  HMMs 
are  such  generative  models  by  definition.  Therefore, 
expectation-maximization  using  an  appropriate  ver¬ 
sion  of  the  forward-backward  algorithm  can  be  used 
to  learn  parameters  that  maximize  the  likelihood  of 
a  given  training  set  of  pairs  of  strings  according  to 
the  generative  model  (Ristad  &  Yianilos,  1998;  Ristad 
&  Yianilos,  1996;  Durbin  et  al.,  1998).  Bilenko  and 
Mooney  (2003)  use  EM  to  train  the  probabilities  in 
a  simple  edit  transducer  for  one  of  the  duplicate  de¬ 
tection  measures  they  evaluate.  Eisner  (2002)  gives 
a  general  algorithm  for  learning  weights  for  transduc¬ 
ers,  and  notes  that  the  approach  applies  to  transduc¬ 
ers  with  transition  scores  given  by  globally  normalized 
log-linear  models.  These  models  are  to  CRFs  as  pair 
HMMs  are  to  HMMs. 

The  foregoing  methods  for  training  edit  transduc¬ 
ers  or  pair  HMMs  use  positive  examples  alone,  but 


do  not  need  to  be  given  explicit  alignments  because 
they  do  EM  with  alignment  as  a  latent  (structured) 
variable.  Joachims  (2003)  gives  a  generic  maximum- 
margin  method  for  learning  to  score  alignments  from 
positive  and  negative  examples,  but  the  training  ex¬ 
amples  must  include  the  actual  alignments.  In  ad¬ 
dition,  he  cannot  solve  the  problem  exactly  because 
he  does  not  exploit  factorizations  of  the  problem  that 
yield  a  polynomial  number  of  constraints  and  efficient 
dynamic  programming  search  over  alignments. 

While  the  basic  models  and  algorithms  are  expressed 
in  terms  of  single  letter  edits,  in  practice  it  is  con¬ 
venient  to  use  a  richer  application-specific  set  of  edit 
operations,  for  example  name  abbreviation.  For  ex¬ 
ample,  Brill  and  Moore  (2000)  use  edit  operations  de¬ 
signed  for  spelling  correction  in  a  spelling  correction 
model  trained  by  EM.  Tejada  et  al.  (2001)  has  edit  op¬ 
erations  such  as  abbreviation  and  acronym  for  record 
linkage. 

7  Conclusions 

We  have  presented  a  new  discriminative  model  for 
learning  finite-state  edit  distance  from  postive  and 
negative  examples  consisting  of  matching  and  non¬ 
matching  strings.  It  is  not  necessary  to  provide  se¬ 
quence  alignments  during  training.  Experimental  re¬ 
sults  show  the  method  to  outperform  previous  ap¬ 
proaches. 

The  model  is  an  interesting  member  of  a  family  of 
models  that  use  a  discriminative  objective  function 
to  discover  latent  structure.  The  latent  edit  opera¬ 
tion  sequences  that  are  learning  by  EM  are  indeed  the 
alignments  that  help  discriminate  matching  from  non¬ 
matching  strings. 

We  have  described  in  some  detail  the  finite-state  ver¬ 
sion  of  this  model.  A  context-free  grammar  version  of 
the  model  could,  through  edit  operations  defined  on 
trees,  handle  swaps  of  arbitrarily-sized  substrings. 
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